Skip to content

feat: Add 100% Spark-compatible regex support via codegen dispatcher#4239

Open
andygrove wants to merge 65 commits into
apache:mainfrom
andygrove:java-regexp
Open

feat: Add 100% Spark-compatible regex support via codegen dispatcher#4239
andygrove wants to merge 65 commits into
apache:mainfrom
andygrove:java-regexp

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 6, 2026

Which issue does this PR close?

Part of the simplification discussed in #4310.

Rationale for this change

Add support for all Spark regex expressions (rlike, regexp_extract, regexp_extract_all, regexp_instr, regexp_replace, split) with full java.util.regex compatibility by routing them through the Arrow-direct codegen dispatcher introduced in #4417. The dispatcher Janino-compiles Spark's own doGenCode for the expression, so the regex family inherits Spark-identical semantics with no per-expression glue code.

The native Rust regex engine is potentially faster but cannot fully match Java regex semantics (backreferences, lookaround, embedded flags, etc.). Rather than expose users to two orthogonal axes (engine choice plus a per-expression allowIncompatible flag), this PR collapses to a single engine selector.

Configs

  • spark.comet.exec.regexp.engine in {rust, java}, default java

    • java: route the regex expression through the codegen dispatcher so Spark's own doGenCode (backed by java.util.regex.Pattern) runs inside the Comet pipeline for full Spark semantics. Uses spark.comet.exec.scalaUDF.codegen.enabled (also default true); falls back to Spark with an explanatory message when that flag is disabled.
    • rust: run the native DataFusion regex engine when an implementation exists. Setting this is itself the opt-in for the semantic differences from Java regex: no separate allowIncompatible flag needed. Expressions without a native implementation (regexp_extract, regexp_extract_all, regexp_instr) fall through to the JVM codegen dispatcher so users still get Comet acceleration with full Spark semantics.
  • spark.comet.exec.scalaUDF.codegen.enabled now defaults to true (was false). With pure defaults, the regex family runs on the Comet path with Spark-identical semantics, and the DateFormatClass dispatcher path is similarly active. Setting the flag to false reverts to Spark fallback for paths that depend on the dispatcher.

  • Per-expression disable: each regex expression has a spark.comet.expression.<ClassName>.enabled flag (default true) that disables Comet's serde for that expression. Useful for narrowing a regression or comparing performance on a single operator without touching the engine selector.

What changes are included in this PR?

  • Add a RegexpRoute helper in strings.scala that each regex serde delegates to. It picks between the native Rust engine, the codegen dispatcher, and Spark fallback based on engine and scalaUDF.codegen.enabled. Under engine=rust, expressions with no native path fall through to the dispatcher rather than to Spark.
  • For expressions with no native Rust path (regexp_extract, regexp_extract_all, regexp_instr), introduce a CometRegexpCodegenOnly base class. Each serde is a one-line subclass.
  • For expressions with a native path (rlike, regexp_replace, split), the JVM arm delegates to CometScalaUDF.emitJvmCodegenDispatch. The native arm is unchanged.
  • Native serdes surface as Incompatible(notes, optedInBy="...engine=rust") so the standard gating in QueryPlanSerde recognizes engine=rust as the opt-in via optedInBy.
  • Extend SupportLevel.Incompatible with an optedInBy: Option[String] field, plumbed through scalar- and aggregate-expression gating in QueryPlanSerde.
  • Add the spark.comet.exec.regexp.engine config in CometConf.
  • Flip the default of spark.comet.exec.scalaUDF.codegen.enabled to true and drop "experimental" language from the regex/codegen docs and config strings.
  • Remove RegExp.isSupportedPattern (was a placeholder always returning false).
  • Document the model in docs/source/user-guide/latest/compatibility/regex.md, including the per-expression disable knobs.

How are these changes tested?

  • CometRegExpJvmSuite: 46 tests covering all six regex expressions with engine=java and the codegen flag enabled, plus a regression test that exercises the engine=rust → JVM dispatcher fallthrough for regexp_extract, regexp_extract_all, and regexp_instr.
  • 9 SQL test files: rlike_{java,rust}.sql, regexp_replace_{java,rust}.sql, split_{java,rust}.sql, regexp_extract.sql, regexp_extract_all.sql, regexp_instr.sql.
  • CometStringExpressionSuite, CometSqlFileTestSuite, CometCodegenSuite, and CometTemporalExpressionSuite continue to pass; split tests migrated from the legacy per-class allowIncompatible flag to engine=rust.

Migration notes

  • spark.comet.exec.scalaUDF.codegen.enabled now defaults to true. With pure defaults the regex family and the DateFormatClass dispatcher path run on Comet rather than falling back to Spark. Set the flag to false to restore the old behavior.
  • The default regex engine changed from rust to java. With the dispatcher now on by default, the regex family runs with full Spark semantics out of the box.
  • Under engine=rust, regexp_extract, regexp_extract_all, and regexp_instr now fall through to the JVM codegen dispatcher instead of Spark (previously they fell back to Spark because they have no native rust path).
  • Users who previously set spark.comet.expression.regexp.allowIncompatible=true to enable the rust path should switch to spark.comet.exec.regexp.engine=rust. The per-expression flag is no longer consulted by the regex family.
  • Users who previously set spark.comet.expression.StringSplit.allowIncompatible=true should likewise switch to spark.comet.exec.regexp.engine=rust.

Also fix CometArrayExpressionSuite compilation by qualifying the Spark
udf() call, which was shadowed by the new org.apache.comet.udf package.
Implements a DataFusion PhysicalExpr that evaluates child expressions,
exports the results as Arrow FFI arrays, calls
CometUdfBridge.evaluate() via JNI, and imports the output array.
Adds datafusion-comet-jni-bridge as a dependency of the spark-expr crate.
…UDF class via context classloader

Wrap the JNI body in try/finally so input ValueVectors and the result vector
are always closed, even when the UDF or arrow export throws. Resolve the
CometUDF class through the thread context classloader so user-supplied UDF
jars (added via spark.jars) are visible from the bridge.
…ns fall back to Spark

When routing RLike through the JVM UDF, reject Literal(null) and patterns
that fail Pattern.compile during planning. Both cases now produce withInfo
+ None, letting Spark evaluate the expression instead of crashing the
executor task with PatternSyntaxException or NullPointerException.
Make comet_udf_bridge an Option in JVMClasses so a missing
org.apache.comet.udf.CometUdfBridge class (e.g. shading dropped
org.apache.comet.udf.*) no longer crashes executor JVM init. The
JVM-UDF dispatch path returns a clear ExecutionError when the bridge
is unavailable. Also clarify the FFI lifetime contract on the result
import.
Replace string literals "rust"/"java" used for the regexp engine selector
with named constants on CometConf. Tighten CometRLike.getSupportLevel so
it only reports Compatible(None) when the pattern is a Literal, matching
the actual constraint enforced by the convert path.
Literal-folded children no longer get expanded to batch-row count before
crossing JNI; ColumnarValue::Scalar is materialized at length 1, avoiding
an O(rows) copy of values that never vary across the batch. Document the
contract on CometUDF: scalar inputs arrive as length-1 vectors, vector
inputs at the batch row count, and the result must match the longest
input.
# Conflicts:
#	spark/src/main/scala/org/apache/comet/serde/strings.scala
@andygrove
Copy link
Copy Markdown
Member Author

@mbutrovich following on from our discussion about configs yesterday, I filed an issue where we can have that discussion. #4310

@andygrove andygrove moved this to In progress in Comet Development May 13, 2026
@andygrove andygrove added this to the 0.17.0 (June 2026) milestone May 13, 2026
…nature

PR apache#4306 added a numRows parameter to CometUDF.evaluate; merging main
into this branch brought in the trait change but the six regexp UDF
implementations still used the old single-argument signature, breaking
comet-common compilation across all Spark profiles.
andygrove added 2 commits May 19, 2026 08:00
…ee-pr-4239

# Conflicts:
#	spark/src/main/scala/org/apache/comet/udf/RegExpExtractAllUDF.scala
#	spark/src/main/scala/org/apache/comet/udf/RegExpExtractUDF.scala
#	spark/src/main/scala/org/apache/comet/udf/RegExpInStrUDF.scala
#	spark/src/main/scala/org/apache/comet/udf/RegExpLikeUDF.scala
#	spark/src/main/scala/org/apache/comet/udf/RegExpReplaceUDF.scala
#	spark/src/main/scala/org/apache/comet/udf/StringSplitUDF.scala
Adds a master switch (default false) for the experimental JVM UDF framework
so the Java regex engine cannot be activated without an explicit opt-in. With
engine=java but jvmUdf.enabled=false, the six regex serdes return Unsupported
with a message naming the master switch instead of silently using either path.

Also extends Incompatible with optedInBy: Option[String] so a config (e.g. an
engine selector) can serve as a per-expression incompatibility opt-in. Existing
allowIncompatible flags continue to work; optedInBy is OR'd into the gating
check in QueryPlanSerde. No existing serde uses optedInBy yet — this lays the
foundation for the config simplification discussed in apache#4310.
@andygrove
Copy link
Copy Markdown
Member Author

@mbutrovich I pushed some config changes, inspired by our earlier discussions - let me know what you think of the direction

Default engine is now `java` (routes through the JVM UDF when
spark.comet.jvmUdf.enabled=true; falls back to Spark otherwise). Setting
engine=rust runs the native Rust regex engine and is itself the opt-in for
the semantic differences from Java regex — no separate allowIncompatible
flag for the regex family.

- Remove RegExp.isSupportedPattern (was a placeholder returning false)
- Replace per-serde engine checks with a single RegexpRoute helper
- Drop redundant *_rust_enabled.sql variants and migrate
  CometStringExpressionSuite split tests off the legacy per-class
  allowIncompatible flag
@andygrove
Copy link
Copy Markdown
Member Author

@mbutrovich I pushed some config changes, inspired by our earlier discussions - let me know what you think of the direction

I posted this comment prematurely - I still had local changes. They are pushed now.

andygrove added 4 commits May 20, 2026 07:59
Dual-impl regex serdes (rlike, regexp_replace, split) now return
Incompatible(notes, optedInBy="spark.comet.exec.regexp.engine=rust") for
the native rust path instead of Compatible. The standard QueryPlanSerde
gating then sees engine=rust as the opt-in via the optedInBy mechanism
introduced earlier, so the incompatibility is visible in EXPLAIN/logs
rather than hidden behind a routing-helper short-circuit.
Drop redundant interpolators in COMET_REGEXP_ENGINE doc string and remove
the redundant CometConf self-import in CometStringExpressionSuite to
satisfy scalafix. Switch existing rlike/regexp_replace tests to opt in via
COMET_REGEXP_ENGINE=rust now that the engine selector is the gate for the
Rust path, and reformat regex.md via prettier.
Resolve conflicts in pr_build_{linux,macos}.yml by integrating both the
new codegen-suite additions from main and the CometRegExpJvmSuite from
the PR, dropping the obsolete standalone "sql" matrix entry that main
folded into the "spark" matrix.

Resolve CometConf.scala by retaining both COMET_JVM_UDF_ENABLED /
COMET_REGEXP_ENGINE from the PR and COMET_SCALA_UDF_CODEGEN_ENABLED
from main. The follow-up refactor drops COMET_JVM_UDF_ENABLED in favor
of COMET_SCALA_UDF_CODEGEN_ENABLED.
…of hand-written UDFs

Replace the six hand-written `RegExp*UDF` / `StringSplitUDF` JVM UDF
implementations with the Arrow-direct codegen dispatcher introduced in
PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher
Janino-compiles Spark's own `doGenCode` for the expression, so the
regex family inherits Spark-identical semantics with no per-expression
glue code.

Changes:

- Delete `spark/src/main/scala/org/apache/comet/udf/RegExp*UDF.scala`
  and `StringSplitUDF.scala`. Their behavior is now provided by
  Spark's `doGenCode` running inside the dispatcher.
- Rewrite the regex serdes in `strings.scala`. Expressions with no
  native Rust path (`RegExpExtract`, `RegExpExtractAll`, `RegExpInStr`)
  share a new `CometRegexpCodegenOnly` base; expressions with a native
  path (`RLike`, `RegExpReplace`, `StringSplit`) keep an explicit
  route table where the JVM arm now delegates to
  `CometScalaUDF.emitJvmCodegenDispatch`.
- Drop the `spark.comet.jvmUdf.enabled` config. The codegen dispatcher
  already has its own master switch
  (`spark.comet.exec.scalaUDF.codegen.enabled`); gating the regex
  family on the same flag avoids two flags for the same path.
  `spark.comet.exec.regexp.engine` keeps the `java`/`rust` selector
  semantics, and `engine=java` now requires the codegen flag.
- Revert the native Rust additions in `jvm_udf/mod.rs` and
  `jni-bridge/src/lib.rs`. The codegen dispatcher constructs Arrow
  output fields JVM-side via `CometBatchKernelCodegenOutput.toFfiArrowField`,
  so the list-vector field-name normalization cast is unnecessary.
- Update `CometRegExpJvmSuite`, `CometRegExpBenchmark`, the regex SQL
  test fixtures, and the regex compatibility doc to reflect the new
  gating.

Test plan:
- `CometRegExpJvmSuite`: 45/45 pass (covers all six regex expressions
  through the codegen dispatcher).
- `CometSqlFileTestSuite`: 289/289 pass.
- `CometStringExpressionSuite`: 33/33 pass.
- `CometCodegenSuite`: 60/60 pass.
- `cargo clippy --all-targets --workspace -- -D warnings`: clean.
@andygrove andygrove changed the title feat: add experimental support for Spark regexp expressions via JVM UDF framework feat: experimental Spark regex support via codegen dispatcher May 26, 2026
@andygrove
Copy link
Copy Markdown
Member Author

@mbutrovich As discussed, I refactored this PR to use codegen dispatch.

The per-expression spark.comet.expression.regexp.allowIncompatible flag
is no longer consulted by the regex family. Switch to engine=rust so the
RLike serde reaches convertViaNativeRegex and emits the
'Only scalar regexp patterns are supported' fallback message the test
asserts on.
@mbutrovich mbutrovich self-requested a review May 28, 2026 17:43
Flip the default of spark.comet.exec.scalaUDF.codegen.enabled to true and
drop "experimental" language across the regex/codegen docs and CometConf
strings. With this default, the regex family (java engine path) and the
DateFormat dispatcher route through Comet's Arrow-direct codegen kernel
out of the box, so users see Comet acceleration for regex and complex
date formatting without per-conf opt-in.

The sentinel guard in CometSqlFileTestSuite still keys off the explicit
"=true" opt-in: most expression fixtures use their own native paths and
do not exercise the dispatcher, so we leave that scope unchanged.
@andygrove andygrove changed the title feat: experimental Spark regex support via codegen dispatcher feat: Add 100% Spark-compatible regex support via codegen dispatcher May 28, 2026
andygrove added 6 commits May 28, 2026 14:02
…nted regex

Previously engine=rust returned Spark fallback for regexp_extract,
regexp_extract_all, and regexp_instr because they have no native Rust
path. With the codegen dispatcher now enabled by default, prefer the
JVM dispatcher over Spark in that case so users still get Comet
acceleration with full Spark semantics. Only decline to native and
dispatcher are both unavailable.

Also document the per-expression
spark.comet.expression.<ClassName>.enabled disable knobs in the regex
compatibility guide, and add a regression test that exercises the new
rust→JVM fallthrough.
Flip spark.comet.exec.scalaUDF.codegen.enabled back to false and restore
the experimental, disabled-by-default language across the regex/codegen
docs and CometConf strings. With this default, the regex family (java
engine path) and the DateFormat dispatcher fall back to Spark unless the
user explicitly opts in.

This keeps the engine=rust JVM-dispatcher fallthrough behavior introduced
separately on this branch; only the codegen-enabled-by-default change is
reverted.
Remove the remaining experimental/disabled-by-default framing from
regex.md so the Java engine reads as a normal, supported regex engine
gated behind spark.comet.exec.scalaUDF.codegen.enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants